About this Notebook

This notebook demonstrates the experience of using ML Workbench to create a machine learning model for text classification and setting it up for online prediction. This is the "cloud run" version of previous notebook. Preprocessing, training, batch prediction, are all done in cloud with various of services. The cloud run can be distributed, so it can handle really large data. Although in this case there is little benefit with the small demo data, the purpose is to demonstrate the usage of cloud run mode of ML Workbench.

There are only a few things that need to change between "local run" and "cloud run":

all data sources or file paths must be on GCS.
the --cloud flag must be set for each step.
"cloud_config" can be set for cloud specific settings, such as project_id, machine_type. In some cases it is required.

Other than this, nothing else changes from local to cloud!

If you have any feedback, please send them to datalab-feedback@google.com.

Validate Data

Assuming you have run the previous notebook (Text classification with MLWorkbench (small dataset experience)) so you already have cleaned data saved.



In [2]:

    
# Make sure you have the processed data there.
!ls ./data









    





          
          
          






    



eval.csv  train.csv  vocab.txt

Create Model with ML Workbench

The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the cleaned data from the previous notebook and build a text classification model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.

For details of each command, run with --help. For example, "%%ml train --help".

This notebook shows the cloud version of every command, and gives the normal experience when building models are large datasets. However, we will still use the 20 newsgroup data.

Setup: Move the data to GCS

The csv files, and all input files to the MLWorkbench magics must exist on GCS first. Therefore the first step is to make a new GCS bucket and copy the local csv files to GCS.



In [4]:

    
!gsutil mb gs://datalab-mlworkbench-20newslab









    





          
          
          






    



Creating gs://datalab-mlworkbench-20newslab/...



In [5]:

    
!gsutil -m cp ./data/train.csv ./data/eval.csv gs://datalab-mlworkbench-20newslab









    





          
          
          






    



Copying file://./data/train.csv [Content-Type=text/csv]...
Copying file://./data/eval.csv [Content-Type=text/csv]...
\ [2/2 files][ 11.7 MiB/ 11.7 MiB] 100% Done                                    
Operation completed over 2 objects/11.7 MiB.



In [1]:

    
import google.datalab.contrib.mlworkbench.commands  # This loads the '%%ml' magics



In [7]:

    
%%ml dataset create
name: newsgroup_data_gcs
format: csv
schema:
  - name: news_label
    type: STRING
  - name: text
    type: STRING  
train: gs://datalab-mlworkbench-20newslab/train.csv
eval: gs://datalab-mlworkbench-20newslab/eval.csv

Step 1: Analyze

In cloud run, analysis is implemented with BigQuery. Running it may incur some costs.



In [8]:

    
%%ml analyze --cloud
output: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs
features:
    news_label:
        transform: target
    text:
        transform: bag_of_words









    





          
          
          






    



Analyzing column news_label...
column news_label analyzed.
Analyzing column text...
Updated property [core/project].
column text analyzed.
Updated property [core/project].

Step 2: Transform

In cloud run, analysis is implemented with Cloud DataFlow. Running it may incur some costs.



In [ ]:

    
!gsutil -m rm -rf gs://datalab-mlworkbench-20newslab/transform # Delete previous results if any.



In [12]:

    
%%ml transform --cloud
output: gs://datalab-mlworkbench-20newslab/transform
analysis: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs









    





          
          
          






    



/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:113: DeprecationWarning: object() takes no parameters
  super(GcsIO, cls).__new__(cls, storage_client))
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
running sdist
running egg_info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md

running check
warning: check: missing required meta-data: url

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_analysis.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-1.0.0/trainer.egg-info
Writing trainer-1.0.0/setup.cfg
Creating tar archive
removing 'trainer-1.0.0' (and everything under it)
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==2.0.0
  Using cached google-cloud-dataflow-2.0.0.tar.gz
  Saved /tmp/tmps1gmKR/google-cloud-dataflow-2.0.0.tar.gz
Successfully downloaded google-cloud-dataflow
View job at https://console.developers.google.com/dataflow/job/2017-10-19_14_46_14-13866291315260581735?project=bradley-playground
/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:113: DeprecationWarning: object() takes no parameters
  super(GcsIO, cls).__new__(cls, storage_client))
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
running sdist
running egg_info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md

running check
warning: check: missing required meta-data: url

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_analysis.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-1.0.0/trainer.egg-info
Writing trainer-1.0.0/setup.cfg
Creating tar archive
removing 'trainer-1.0.0' (and everything under it)
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==2.0.0
  Using cached google-cloud-dataflow-2.0.0.tar.gz
  Saved /tmp/tmpUNLeK7/google-cloud-dataflow-2.0.0.tar.gz
Successfully downloaded google-cloud-dataflow
View job at https://console.developers.google.com/dataflow/job/2017-10-19_14_46_24-7522432058133585353?project=bradley-playground

Click the links in output cell to monitor the jobs progress. Once they are completed (usually within 15 minutes with the job startup overhead), check the output.



In [13]:

    
!gsutil ls gs://datalab-mlworkbench-20newslab/transform









    





          
          
          






    



gs://datalab-mlworkbench-20newslab/transform/errors_eval-00000-of-00001.txt
gs://datalab-mlworkbench-20newslab/transform/errors_train-00000-of-00001.txt
gs://datalab-mlworkbench-20newslab/transform/eval-00000-of-00003.tfrecord.gz
gs://datalab-mlworkbench-20newslab/transform/eval-00001-of-00003.tfrecord.gz
gs://datalab-mlworkbench-20newslab/transform/eval-00002-of-00003.tfrecord.gz
gs://datalab-mlworkbench-20newslab/transform/train-00000-of-00003.tfrecord.gz
gs://datalab-mlworkbench-20newslab/transform/train-00001-of-00003.tfrecord.gz
gs://datalab-mlworkbench-20newslab/transform/train-00002-of-00003.tfrecord.gz
gs://datalab-mlworkbench-20newslab/transform/tmp/



In [2]:

    
%%ml dataset create
name: newsgroup_data_gcs_transformed
format: transformed
train: gs://datalab-mlworkbench-20newslab/transform/train-*
eval: gs://datalab-mlworkbench-20newslab/transform/eval-*

Step 3: Training

In cloud run, training is implemented with Cloud ML Engine Training service. Running it may incur some costs.



In [ ]:

    
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!gsutil -m rm -fr gs://datalab-mlworkbench-20newslab/train

Note that, "runtime_version: '1.2'" specifies which TensorFlow version is used at training. The first time training is a bit slower because of warm up, but if you train it multiple times the runs after first will be faster.



In [3]:

    
%%ml train --cloud
output: gs://datalab-mlworkbench-20newslab/train
analysis: gs://datalab-mlworkbench-20newslab/analysis
data: newsgroup_data_gcs_transformed
model_args:
    model: linear_classification
    top-n: 5
cloud_config:
    scale_tier: BASIC
    region: us-central1
    runtime_version: '1.2'









    





          
          
          






    




Job "trainer_task_171019_223905" submitted.Click here to view cloud log. 







    




TensorBoard was started successfully with pid 589. Click here to access it.



In [4]:

    
# Once training is done, check the output.
!gsutil list gs://datalab-mlworkbench-20newslab/train









    





          
          
          






    



gs://datalab-mlworkbench-20newslab/train/schema_without_target.json
gs://datalab-mlworkbench-20newslab/train/evaluation_model/
gs://datalab-mlworkbench-20newslab/train/model/
gs://datalab-mlworkbench-20newslab/train/staging/
gs://datalab-mlworkbench-20newslab/train/train/

Step 4: Evaluation using batch prediction

See previous notebook (Text Classification --- 20NewsGroup (small data)). You can do local batch prediction with the model trained in cloud.



In [2]:

    
%%ml batch_predict
model: gs://datalab-mlworkbench-20newslab/train/evaluation_model
output: gs://datalab-mlworkbench-20newslab/prediction
format: csv
data:
  csv: gs://datalab-mlworkbench-20newslab/eval.csv









    





          
          
          






    



local prediction...
INFO:tensorflow:Restoring parameters from gs://datalab-mlworkbench-20newslab/train/evaluation_model/variables/variables
done.



In [2]:

    
!gsutil ls gs://datalab-mlworkbench-20newslab/prediction/









    





          
          
          






    



gs://datalab-mlworkbench-20newslab/prediction/
gs://datalab-mlworkbench-20newslab/prediction/predict_results_eval.csv
gs://datalab-mlworkbench-20newslab/prediction/predict_results_schema.json



In [5]:

    
%%ml evaluate confusion_matrix --plot
size: 15
csv: gs://datalab-mlworkbench-20newslab/prediction/predict_results_eval.csv



In [6]:

    
%%ml evaluate accuracy
csv: gs://datalab-mlworkbench-20newslab/prediction/predict_results_eval.csv









    





          
          
          






    Out[6]:






  
    
      
      accuracy
      count
      target
    
  
  
    
      0
      0.445141
      319
      alt.atheism
    
    
      1
      0.665810
      389
      comp.graphics
    
    
      2
      0.606599
      394
      comp.os.ms-windows.misc
    
    
      3
      0.553571
      392
      comp.sys.ibm.pc.hardware
    
    
      4
      0.600000
      385
      comp.sys.mac.hardware
    
    
      5
      0.673418
      395
      comp.windows.x
    
    
      6
      0.792308
      390
      misc.forsale
    
    
      7
      0.669192
      396
      rec.autos
    
    
      8
      0.728643
      398
      rec.motorcycles
    
    
      9
      0.836272
      397
      rec.sport.baseball
    
    
      10
      0.817043
      399
      rec.sport.hockey
    
    
      11
      0.623737
      396
      sci.crypt
    
    
      12
      0.552163
      393
      sci.electronics
    
    
      13
      0.676768
      396
      sci.med
    
    
      14
      0.700508
      394
      sci.space
    
    
      15
      0.701005
      398
      soc.religion.christian
    
    
      16
      0.620879
      364
      talk.politics.guns
    
    
      17
      0.670213
      376
      talk.politics.mideast
    
    
      18
      0.412903
      310
      talk.politics.misc
    
    
      19
      0.243028
      251
      talk.religion.misc
    
    
      20
      0.641264
      7532
      _all

Prediction

Local Instant Prediction

Local instant prediction works with model trained in cloud too.



In [7]:

    
%%ml predict
model: gs://datalab-mlworkbench-20newslab/train/model
data:
  - nasa
  - windows xp









    





          
          
          






    





  
    
      predicted
      predicted_2
      predicted_3
      predicted_4
      predicted_5
      probability
      probability_2
      probability_3
      probability_4
      probability_5
      text
    
  
  
    
      sci.space
      rec.motorcycles
      comp.graphics
      rec.sport.baseball
      rec.autos
      0.099317
      0.063897
      0.062198
      0.061433
      0.056892
      nasa
    
    
      comp.os.ms-windows.misc
      comp.graphics
      comp.windows.x
      misc.forsale
      rec.motorcycles
      0.153955
      0.068175
      0.063610
      0.062289
      0.056323
      windows xp

Deploying Model to ML Engine



In [8]:

    
%%ml model deploy
name: newsgroup.alpha
path: gs://datalab-mlworkbench-20newslab/train/model









    





          
          
          






    



Waiting for operation "projects/bradley-playground/operations/create_newsgroup_alpha-1508459499009"
Done.

Batch Prediction



In [9]:

    
# Let's create a CSV file from eval.csv by removing the target column.
with open('./data/eval.csv', 'r') as f, open('./data/test.csv', 'w') as fout:
    for l in f:
        fout.write(l.split(',')[1])



In [12]:

    
!gsutil cp ./data/test.csv gs://datalab-mlworkbench-20newslab/test.csv









    





          
          
          






    



Copying file://./data/test.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/4.2 MiB.



In [13]:

    
%%ml batch_predict --cloud
model: newsgroup.alpha
output: gs://datalab-mlworkbench-20newslab/test
format: json
data:
    csv: gs://datalab-mlworkbench-20newslab/test.csv
cloud_config:
    region: us-central1









    





          
          
          






    




Job "prediction_171020_003440" submitted.Click here to view cloud log.

Once job is completed, take a look at the results.



In [14]:

    
!gsutil ls -lh gs://datalab-mlworkbench-20newslab/test









    





          
          
          






    



      42 B  2017-10-20T00:38:10Z  gs://datalab-mlworkbench-20newslab/test/prediction.errors_stats-00000-of-00001
 44.82 KiB  2017-10-20T00:38:14Z  gs://datalab-mlworkbench-20newslab/test/prediction.results-00000-of-00002
293.54 KiB  2017-10-20T00:38:14Z  gs://datalab-mlworkbench-20newslab/test/prediction.results-00001-of-00002
TOTAL: 3 objects, 346523 bytes (338.4 KiB)



In [15]:

    
!gsutil cat gs://datalab-mlworkbench-20newslab/test/prediction.results* | head -n 2









    





          
          
          






    



{"probability": 0.3283481001853943, "probability_5": 0.05025269091129303, "probability_4": 0.0888153463602066, "predicted": "talk.politics.guns", "probability_3": 0.08881844580173492, "probability_2": 0.10015636682510376, "predicted_2": "talk.politics.misc", "predicted_3": "talk.politics.mideast", "predicted_4": "rec.motorcycles", "predicted_5": "talk.religion.misc"}
{"probability": 0.9899819493293762, "probability_5": 0.00031947894603945315, "probability_4": 0.0009666418773122132, "predicted": "sci.space", "probability_3": 0.0011659186566248536, "probability_2": 0.006039201747626066, "predicted_2": "talk.politics.misc", "predicted_3": "sci.crypt", "predicted_4": "talk.politics.guns", "predicted_5": "comp.os.ms-windows.misc"}

Prediction from a python client

See the previous notebook in this sequence for the example.

Clean up

This section is optional. We will delete all the GCP resources and local files created in this sequence of notebooks. If you are not ready to delete anything, don't run any of the following cells.



In [16]:

    
%%ml model delete
name: newsgroup.alpha









    





          
          
          






    



Waiting for operation "projects/bradley-playground/operations/delete_newsgroup_alpha-1508460012953"
Done.



In [ ]:

    
%%ml model delete
name: newsgroup



In [ ]:

    
# Delete the files in the GCS bucket, and delete the bucket
!gsutil -m rm -r gs://datalab-mlworkbench-20newslab



In [ ]:

	accuracy	count	target
0	0.445141	319	alt.atheism
1	0.665810	389	comp.graphics
2	0.606599	394	comp.os.ms-windows.misc
3	0.553571	392	comp.sys.ibm.pc.hardware
4	0.600000	385	comp.sys.mac.hardware
5	0.673418	395	comp.windows.x
6	0.792308	390	misc.forsale
7	0.669192	396	rec.autos
8	0.728643	398	rec.motorcycles
9	0.836272	397	rec.sport.baseball
10	0.817043	399	rec.sport.hockey
11	0.623737	396	sci.crypt
12	0.552163	393	sci.electronics
13	0.676768	396	sci.med
14	0.700508	394	sci.space
15	0.701005	398	soc.religion.christian
16	0.620879	364	talk.politics.guns
17	0.670213	376	talk.politics.mideast
18	0.412903	310	talk.politics.misc
19	0.243028	251	talk.religion.misc
20	0.641264	7532	_all